feat: evaluation ingestion (no user-facing feature is added) #1764

RogerHYang · 2023-11-16T17:30:24Z

Purpose

Provides a Jupyter notebook and experimental helper functions for
- extracting spans from Session
- running evals on those spans
- ingesting eval results back to Session
Adds internal data structures for eval ingestions and gql queries

Changes

Defines proto for Evaluation
Adds Evaluation to HttpExporter
Adds fixture parquet files for evaluations
Adds http endpoint and receiver for Evaluation
Adds px.core.evals.Evals (analogous to px.core.traces.Traces) to store the received Evaluation
Attaches px.core.evals.Evals to px.session.session.Session
Adds evaluations to the spans query in GraphQL
Adds GraphQL query to retrieve all available span evaluation names
Adds Jupyter Notebook for ingesting evaluations after running llm_classify
Adds helper functions for the Jupyter Notebook, e.g. to extract spans from the Phoenix session

Caveats

Duplicate evaluations, when ingested for a second time, overwrites the existing ones

GraphQL Sample Output

Use the trace fixture llama_index_rag.

Span Evaluation Names

GraphQL Query

query Query {
  spanEvaluationNames
}

Span Evaluations and Document Evaluations

GraphQL Query

query Query {
  spans(filterCondition:"name == 'query' or span_kind == 'RETRIEVER'") {
    edges {
      node {
        name
        context {
          spanId
        }
        input {
          value
        }
        spanEvaluations {
          name
          score
          label
          explanation
        }
        documentEvaluations {
          name
          documentPosition
          score
          label
          explanation
        }
      }
    }
  }
}

mikeldking

Niiice! Left some comments but you walked me through it so will give others a chance to take a look before stamping. Ping me tomorrow.

mikeldking · 2023-11-16T22:27:57Z

app/schema.graphql

+  label: String
+  explanation: String
+  spanId: String!
+  documentPosition: Int!


docs: Adding descriptions for the fields would be useful

mikeldking · 2023-11-16T22:29:17Z

app/schema.graphql

+  score: Float
+  label: String
+  explanation: String
+  spanId: String!


Did you mean to expose the spanId here? If so this becomes a bit less generic. Totally a nit.

Good catch! I was just copy/pasting what I had in mind for the Python code.

mikeldking · 2023-11-16T22:30:40Z

app/schema.graphql

+  score: Float
+  label: String
+  explanation: String
+  spanId: String!


Same as above - maybe this was for troubleshooting but not needed I think?

yup. will remove. thx

mikeldking · 2023-11-16T22:33:48Z

src/phoenix/server/api/types/Evaluation.py

+
+@strawberry.type
+class SpanEvaluation(Evaluation):
+    span_id: str


I think you can mark this as private

mikeldking · 2023-11-16T22:34:37Z

src/phoenix/server/api/types/Span.py

@@ -122,6 +124,14 @@ class Span:
        description="Cumulative (completion) token count from self and all "
        "descendant spans (children, grandchildren, etc.)",
    )
+    span_evaluations: List[SpanEvaluation] = strawberry.field(
+        description="Span evaluations",


Nit, no need to repeat the name - best to be more verbose and informative in the descriptions.

yea, that makes sense

mikeldking · 2023-11-16T22:39:18Z

src/phoenix/server/api/types/Span.py

+    span_evaluations: List[SpanEvaluation] = []
+    document_evaluations: List[DocumentEvaluation] = []
+    span_id = span.context.span_id
+    for evaluation in evals.get_evaluations_by_span_id(span_id) if evals else ():
+        span_evaluations.append(SpanEvaluation.from_pb_evaluation(evaluation))
+    for evaluation in evals.get_document_evaluations_by_span_id(span_id) if evals else ():
+        document_evaluations.append(DocumentEvaluation.from_pb_evaluation(evaluation))


optimization: You can place this code of getting evaluations and documents on the Span node. This will alleviate the load of the query if the query doesn't ask for anything related to evals. If there are multiple fields that require evaluations by span_id, you can wrap that in a dataloader to eliminate the n+1. Can do this as a follow-up but it's a worthwhile refactor as it's always good to eliminate over-fetching if possible.

Some links to dataloading: https://leebyron.com/dataloader-v2/

oh yea, that's right. we had talked about this before. I totally forgot about it

mikeldking · 2023-11-16T22:45:45Z

src/phoenix/server/main.py

+        Thread(
+            target=_load_items,
+            args=(evals, fixture_evals, simulate_streaming),
+            daemon=True,
+        ).start()


future thought: I might need to have a "dirty" bit to know when evals change so I know how to refetch due to the lack of subscriptions :(

not sure how exactly. let's catch up later

mikeldking · 2023-11-16T22:49:14Z

src/phoenix/session/evaluaton.py

+    for index, row in evaluations.iterrows():
+        subject_id = _extract_subject_id(cast(Union[str, Tuple[str]], index), index_names)
+        result = _extract_result(row)
+        evaluation = pb.Evaluation(
+            name=evaluation_name,
+            result=result,
+            subject_id=subject_id,
+        )
+        exporter.export(evaluation)


optimization: this pretty inefficient overall - I get that we are trying to leverage existing code but I think uploading this in bulk / chunks feels much more practical?

yup, totally agree. I was just trying to limit the scope of this PR. Will definitely upgrade in a future PR

mikeldking · 2023-11-16T22:51:45Z

src/phoenix/session/evaluaton.py

+    if index_names and index_names[0].endswith("span_id"):
+        if len(index_names) == 2 and index_names[1].endswith("document_position"):


there's a bit of magic here that would not be intuitive to the reader. Can you figure out how to maybe leverage variable names and a bit of doc-strings to make this easier to groc? Without being intimately familiar with the structure of the data, I think this will go over a reader's head.

agree. will add docstring

mikeldking · 2023-11-16T22:55:30Z

Minor optimization on the PR comment - Might help if you put graphql codeblocks so others can copy / paste and try out the queries for themeselves.

RogerHYang

added sample graphql queries to the PR description

RogerHYang · 2023-11-16T23:34:26Z

app/schema.graphql

+  score: Float
+  label: String
+  explanation: String
+  spanId: String!


Good catch! I was just copy/pasting what I had in mind for the Python code.

RogerHYang · 2023-11-16T23:35:45Z

src/phoenix/server/api/types/Evaluation.py

+
+@strawberry.type
+class SpanEvaluation(Evaluation):
+    span_id: str


RogerHYang · 2023-11-16T23:36:21Z

src/phoenix/session/evaluaton.py

+    for index, row in evaluations.iterrows():
+        subject_id = _extract_subject_id(cast(Union[str, Tuple[str]], index), index_names)
+        result = _extract_result(row)
+        evaluation = pb.Evaluation(
+            name=evaluation_name,
+            result=result,
+            subject_id=subject_id,
+        )
+        exporter.export(evaluation)


yup, totally agree. I was just trying to limit the scope of this PR. Will definitely upgrade in a future PR

RogerHYang · 2023-11-16T23:37:28Z

src/phoenix/session/evaluaton.py

+    if index_names and index_names[0].endswith("span_id"):
+        if len(index_names) == 2 and index_names[1].endswith("document_position"):


agree. will add docstring

RogerHYang · 2023-11-16T23:38:14Z

app/schema.graphql

+  score: Float
+  label: String
+  explanation: String
+  spanId: String!


yup. will remove. thx

RogerHYang · 2023-11-16T23:38:45Z

src/phoenix/server/api/types/Span.py

@@ -122,6 +124,14 @@ class Span:
        description="Cumulative (completion) token count from self and all "
        "descendant spans (children, grandchildren, etc.)",
    )
+    span_evaluations: List[SpanEvaluation] = strawberry.field(
+        description="Span evaluations",


yea, that makes sense

RogerHYang · 2023-11-16T23:40:52Z

src/phoenix/server/api/types/Span.py

+    span_evaluations: List[SpanEvaluation] = []
+    document_evaluations: List[DocumentEvaluation] = []
+    span_id = span.context.span_id
+    for evaluation in evals.get_evaluations_by_span_id(span_id) if evals else ():
+        span_evaluations.append(SpanEvaluation.from_pb_evaluation(evaluation))
+    for evaluation in evals.get_document_evaluations_by_span_id(span_id) if evals else ():
+        document_evaluations.append(DocumentEvaluation.from_pb_evaluation(evaluation))


oh yea, that's right. we had talked about this before. I totally forgot about it

RogerHYang · 2023-11-16T23:41:22Z

src/phoenix/server/main.py

+        Thread(
+            target=_load_items,
+            args=(evals, fixture_evals, simulate_streaming),
+            daemon=True,
+        ).start()


not sure how exactly. let's catch up later

RogerHYang · 2023-11-16T23:41:37Z

app/schema.graphql

+  label: String
+  explanation: String
+  spanId: String!
+  documentPosition: Int!


mikeldking

axiomofjoy · 2023-11-20T03:18:10Z

src/phoenix/session/evaluaton.py

@@ -0,0 +1,102 @@
+from typing import Any, Iterable, List, Mapping, Optional, Tuple, Union, cast


Typo in the name of this file.

very good eye! thanks for the catch!

axiomofjoy · 2023-11-20T03:27:47Z

src/phoenix/server/evaluation_handler.py

+        except Exception:
+            return Response(status_code=HTTP_422_UNPROCESSABLE_ENTITY)


Should we return a 500 here instead of a 422?

good catch! it's intended for line 31, so this is actually not the right place to put it

i also confirmed that 500 is automatic via starlette: we don't need to catch anything for it

axiomofjoy · 2023-11-20T04:50:02Z

looking really good @RogerHYang

RogerHYang added 2 commits November 16, 2023 09:27

wip

d53b5b6

recompile proto

55a4eb3

Arize-ai deleted a comment from review-notebook-app bot Nov 16, 2023

RogerHYang added 4 commits November 16, 2023 13:21

Merge branch 'main' into evaluation-ingestion

2baa268

Merge branch 'main' into evaluation-ingestion

313f64e

simulate streaming

d8ba773

remove unused functions

ea44b31

RogerHYang marked this pull request as ready for review November 16, 2023 22:09

RogerHYang changed the title ~~feat: evaluation ingestion~~ feat: evaluation ingestion (no user-facing feature is added) Nov 16, 2023

mikeldking reviewed Nov 16, 2023

View reviewed changes

RogerHYang commented Nov 16, 2023

View reviewed changes

mikeldking approved these changes Nov 18, 2023

View reviewed changes

axiomofjoy reviewed Nov 20, 2023

View reviewed changes

RogerHYang added 13 commits November 20, 2023 07:41

Merge branch 'main' into evaluation-ingestion

2909a3a

fix typo

059e3ad

clean up gql

0514c6f

fix receiver

196963e

add back dropped param

ae753d9

clean up notebook

fc48438

improve gql descriptions

7408f38

clean up gql

1ab6acc

Merge branch 'main' into evaluation-ingestion

4526ecc

fix typo

0b7fded

fix typo

15f71f5

handle missing values as result of interrupts

ada921e

fix format

7357c6b

RogerHYang merged commit 7c4039b into main Nov 21, 2023
10 checks passed

RogerHYang deleted the evaluation-ingestion branch November 21, 2023 00:29

github-actions bot mentioned this pull request Nov 21, 2023

chore(main): release 1.3.0 #1787

Merged

github-actions bot mentioned this pull request Feb 16, 2024

chore(main): release phoenix 4.0.0 #2321

Closed

		if index_names and index_names[0].endswith("span_id"):
		if len(index_names) == 2 and index_names[1].endswith("document_position"):

		@@ -0,0 +1,102 @@
		from typing import Any, Iterable, List, Mapping, Optional, Tuple, Union, cast

		except Exception:
		return Response(status_code=HTTP_422_UNPROCESSABLE_ENTITY)

feat: evaluation ingestion (no user-facing feature is added) #1764

feat: evaluation ingestion (no user-facing feature is added) #1764

Conversation

RogerHYang commented Nov 16, 2023 • edited Loading

Purpose

Changes

Caveats

GraphQL Sample Output

Span Evaluation Names

GraphQL Query

Span Evaluations and Document Evaluations

GraphQL Query

mikeldking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikeldking commented Nov 16, 2023

RogerHYang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikeldking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axiomofjoy commented Nov 20, 2023

RogerHYang commented Nov 16, 2023 •

edited

Loading